The focus of project one for team Wasted Potential was to develop and answer a research-driven SMART question on a health-related dataset. Our SMART question for project one was to determine which of the given variables in our dataset exhibits a measurable effect on the life expectancy of different countries across the world.Through the course of this paper, we will explain what we know about this dataset, the limitations of our dataset, how information was gathered, what preprocessing steps and analysis was completed prior to our work on the dataset, the exploratory data analysis (EDA) we performed, and how our SMART question was answered.
The dataset Wasted Potential chose was “Life Expectancy (WHO): Statistical Analysis on factors influencing Life Expectancy.” Although we found this dataset on Kaggle, it was originally sourced from the Global Health Observatory (GHO) which is a data repository owned by the World Health Organization (WHO). While the original dataset from GHO contains many more variables and observations, the cleaned Kaggle dataset that we used for our project contains 2,938 observations spread over 22 variables that contribute to assessing the overall health status of 193 countries. Kumar Rajarshi, the author of the Kaggle dataset, noted several preprocessing and analysis steps taken to polish the original dataset from GHO. The original dataset showed some missing values, which were handled in R software using the Missmap command. Missmap indicated that most of the missing data was for population, Hepatitis B and GDP in lesser-known countries like Vanuatu, Tonga, Togo, and Cabo Verde. Kumar explained that finding year-specific data for these countries was very difficult; the lesser-known countries were ultimately excluded from the dataset. Additionally, Kumar noted that of the many variables in the original dataset from GHO, only the most critical health-related factors were chosen to be included in their Kaggle dataset. Thus, the final set of variables in this dataset range from health factors such as BMI and immunizations to population factors such as developing status and GDP (Gross Domestic Product per capita). An explanation of the years this dataset covers is as follows: “It has been observed that in the past 15 years, there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis” (Kumar Rajarshi, 2018). Thus, our dataset only encompasses health data in the years 2000-2015. Kumar noted that the economic data for this dataset (i.e. GDP, total expenditure) was sourced from the United Nations website.
Because life expectancy is our response variable and the target of our SMART question, it is important to explain what life expectancy is and how it is recorded in this dataset. Life expectancy is a summary of the overall mortality level of a given population. The reported life expectancy reflects the mortality pattern that exists across all age groups in a current year. For example, in 2016, the global life expectancy was 72 years old, with female values reporting roughly 4 years longer than males. What was particularly interesting about this statistic was the wide range of values that made up the global life expectancy and how they differed by location. In 2016, the reported values ranged from 61.2 years old in the WHO African region to 77.5 years old in the WHO European Region (taken from statistics on who.int site); this gap is over 16 years in size! These values highlight the huge inequality in how health is distributed around the world. For our project, we utilized a combination of descriptive graphs and tests to explore our data and variables. Our EDA will be useful in identifying key factors that can aid in building models to predict future life expectancy values. Additionally, as explained in our presentation and later in this paper there were a variety of missing and likely incorrect values for different countries. While the dataset on Kaggle excludes data for lesser-known countries like Tonga, Togo and Cabo Verde, there are still incorrect or missing values for multiple countries. For example, the population of Afghanistan in 2014 was recorded as 327,582, while online sources record the population in that year to be 32.76 million. The inaccuracy and absence of a variety of values, combined with the limited range of years (2000-2015) comprise the limitations of our dataset.
The dataset of choice for Project One was originally constructed by the World Health Organization (WHO) to track the health status of 193 countries across the world. The question we sought to answer is as follows:
SMART Question: What factors affect life expectancy in individuals across the world?
The dataset includes a variety of factors that contribute to the overall health status of a country. We wanted to know what effect these factors had specifically on life expectancy. An explanation of the different factors in this dataset are as follows:
• Country: Country
• Year: Year
• Status: Developed or Developing status
• Life expectancy: Life Expectancy in age
• Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
• Infant deaths: Number of Infant Deaths per 1000 population
• Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
• Percentage expenditure: Expenditure on health as a percentage of GDP per capita(%)
• Hepatitis B: HepB immunization coverage among 1-year-olds (%)
• Measles: Number of reported measles cases per 1000 population
• BMI: Average Body Mass Index of entire population
• Under-five deaths: Number of under-five deaths per 1000 population
• Polio: Pol3 immunization coverage among 1-year-olds (%)
• Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
• Diphtheria: DTP3 immunization coverage among 1-year-olds (%)
• HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
• GDP: Gross Domestic Product per capita (in USD)
• Population: Population of the country
• Thinness 10-19 years: Prevalence of thinness among children/adolescents, age 10 - 19(%)
• Thinness 5-9 years: Prevalence of thinness among children, age 5 to 9 (%)
• Income comp of resources: HDI in terms of income composition of resources (index ranging from 0 to 1)
• Schooling: Number of years of Schooling (years)
Data Source: Kaggle
The relationship between ‘Adult Mortality’ and ‘Life Expectancy’ was ignored for the duration of this EDA due to the obviousness lack of information that would be gathered from further exploring this. It is redundant information that as the adult mortality rate increased, the life expectancy would decrease. The other variables, however, are explored in further detail.
As we were beginning our analysis of the dataset, we noticed some unusual values in some graphs and statistics. For instance, the “percentage expenditure” variable is a reflection of the money spent on health as a percentage of GDP per capita, as such, the highest possible number a percent value can be in this context is 100%. However, we noticed that one of the data points showed 19480 as the percentage expenditure for Switzerland which did not make sense. In addition, there was also a datapoint that showed China’s percentage expenditure as zero, which we also know is not true. In other instances such as for population values, we noticed values in the thousands, when they should have been in the millions range. We believe that some of the discrepancies in the dataset were perhaps due to owners hand entering values or improperly joining the data from the two different sources (from the WHO and the UN). We ultimately decided that we should not attempt to remove these values as we would have to manually identify discrepancies, and since we are not creating linear models to make predictions it is not necessary at this point. For future analysis we would recommend cleaning the dataset of these incorrect values before proceeding with creating models.
The histogram below presents the distribution of Life Expectancy values in this dataset. It shows that the youngest values start at 36.3 and go all the way up to 89 years old as the oldest expected ages. The mean life expectancy is 69.2 years old and the median life expectancy is 72.1 years old. These values are also indicated in the left skewed histogram below. Typically, in a left skewed distribution, most of the data points are on the right side of the histogram and there are some smaller values towards the left. This is reflected in the histogram for our data as well. The smaller values in the left skew lower the value of the mean which results in a mean smaller than the median (versus a symmetric, normal distribution in which the mean and median values would be even closer together).
life2 <- na.omit(life) #omit missing values
loadPkg("ggplot2")
ggplot(life, aes(x=Life.expectancy) )+
geom_histogram(color="darkblue",fill="lightblue")+
ggtitle("Life Expectancy Histogram")+
xlab("Life Expectancy (Age)")
The QQ-plot below also shows a fairly normal distribution that follows the straight line. Yet, we do see more clear deviation from the poitns towards the upper right portion of tthe graph. Since the points are curving away from the line it further emphasizes the conclusion that the dataset is left skewed. Which follows the takeaway from the earlier histogram as well.
qqnorm(life$Life.expectancy, main="Life Expectancy Q-Q Plot", ylab="Life Expectancy (Age)")
qqline(life$Life.expectancy)
#ggplot(life2, aes(x=Total.expenditure, y=Life.expectancy, fill=Total.expenditure, group = 1)) + geom_boxplot() + scale_fill_brewer(palette="Spectral") + ggtitle("Life Expectancy vs. Total Expenditure") + ylab("Life Expectancy") + xlab("Total Expenditure ($)")
#sapply(life, mean, na.rm=TRUE) # excluding missing values
#sapply(life, sd)
summary(life)
#below is yuki's table
The following table shows some variables’ statistics values. For example, the life expectancy is from 36.3 to 89 years, and total government percentage expenditure is from 0.37 to 17.6. But as we mentioned before there exists some data problems for GDP and Percentage Expenditure like the minimun value for GDP is 2 which does not make sense. The maximum/mean value for Percentage Expenditure is 19480 and 738 respectively which we know it is not true because it is in percentage unit.
| Life Expectancy (Years) | Total Gov Expenditure (%) | GDP (USD/Capita) | Percentage Expenditure (%) | HIV/AIDS (Death/1000) | Schooling (Years) | |
|---|---|---|---|---|---|---|
| Mean | 69.2 | 5.94 | 7483 | 738 | 1.7 | 12 |
| Median | 72.1 | 5.75 | 1767 | 65 | 0.1 | 12.3 |
| S.D. | 9.52 | 2.5 | 14270 | 1988 | 5.08 | 3.36 |
| Range | 36.3 - 89.0 | 0.37 - 17.6 | 2 - 119173 | 0 - 19480 | 0.1 - 50.6 | 0 - 20.7 |
To determine the relationship between numerical variables and ‘Life Expectancy’, a correlation was performed. The variables “Country” and “Status” were removed prior to performing the correlation, as they are factor variables and will be addressed later. Additionally, rows that were blank or had “NA” in them were ignored, as array sized must be equivalent to perform a correlation between variables.
library(dplyr)
sapply(life, class) #look at the class of each variables
life_nofactor = select(life, -c(Country, Status)) #remove the factor variables
cor_life=cor(life_nofactor,use = "complete.obs") #create correlation matrix
library(corrplot)
corrplot(cor_life, type="lower", tl.pos = "l") #plot correlation matrix
The resulting correlation matrix can be viewed above. It was found that the top 5 variables correlated with ‘Life Expectancy’ were ‘Adult Mortality’, ‘Income Composition of Resources’, ‘Schooling’, ‘HIV/AIDS’, and ‘BMI’. The correlation coefficients for these variables were -0.70, 0.73, 0.75, -0.56, and 0.57, respectively. The negative value in front of the correlations for ‘Adult Mortality’ and ‘HIV/AIDS’ indicates that as the life expectancy increases, the adult mortality and incidence of AIDS decreases. It is important to note that the variable ‘Population’ had virtually no correlation with ‘Life Expectancy’, as it had a correlation coefficient of -0.02.
Two interesting variables we wanted to take a closer look at in our initial exploration of this dataset involved the relationship between spending money and life expectancy. There were two variables: percentage expenditure and total expenditure, and it is important to highlight their differences in order to clarify the perspectives they may provide. Percentage expenditure refers to the expenditure on health as a percentage of GDP (Gross Domestic Product) per capita for the given year whereas total expenditure refers to the government’s expenditure on healthcare as a percentage of the total government spending for that given year. We initially thought that if we looked at total expenditure, we would see a clear trend as it could make sense that the more the government spends on healthcare should translate to overall better health outcomes and thus higher life expectancy. But as we can see from the scatterplot graph “Life Expectancy (y) vs. Govt Healthcare Expenditure(x)” below we see no discernable positive or negative relationship between the data points.
plot(life$Total.expenditure, life$Life.expectancy, main=" Life Expectancy (y) vs. Govt Healthcare Expenditure(x) ", xlab="% Healthcare Expenditure (out of total Govt spending)", ylab="Life Expectancy (Age)", pch=19) +
abline(lm(life$Life.expectancy~life$Total.expenditure), col="red") # regression line (y~x)
#ggplot(life, aes(x=factor(Year), y=Life.expectancy))+
# geom_boxplot() +
# facet_wrap(~Status) +
# theme(axis.text.x = element_text(angle = 90)) +
# ylab("Life Expectancy (Age)")
The scatterplot below, “Per Capita Health Expenditure (y) vs GDP (x)”, shows a positive relationship between life expectancy and the amount of money spent on healthcare as a % of GDP per capita. This is more in line with our hypothesis that the more money spent on healthcare should result in better health outcomes and higher life expectancy ages. However, there were some data points spread on left corner of the graph which indicates there were lots of unusual datapoints with GDP approached to zero like Philippines and China. In addition, we could also see that the percentage expenditure with unusual values which approached to 20000 which it did not make sense as Switzerland taking as an example.
It seems that the difference between the graph below and the graph above is indicating that different amounts of government spending does not have a great effect but the overall amount that citizens are willing to spend will be more indicative of greater life expectancy. One hypothesis is that no matter how much the government is willing to spend or not spend on their citizens, each person will do their best to supplement the amount given to them with the sufficient funding to meet their healthcare needs, this could be studied further with more quantitative spending data instead of just percentages.
ggplot(life, aes(x=GDP, y=percentage.expenditure)) +
geom_point() +
geom_smooth() +
ylab("Expenditure on health as a % of GDP per capita(%)") +
xlab("Gross Domestic Product (GDP per capita in USD)") +
ggtitle("Per Capita Health Expediture (y) vs GDP (x)")
The scatterplot below, “Life Expectancy (y) vs Deaths due to HIV/AIDS (x)”, shows a negative trend of life expectancy as Deaths due to HIV/AIDS increased which we know it is a common sense.
ggplot(life, aes(x=HIV.AIDS, y=Life.expectancy)) +
geom_point() +
geom_smooth() +
ylab("Life Expectancy (Age)") +
xlab("Deaths due to HIV/AIDS per 1000 live births (for 0-4 years)")+
ggtitle("Life Expectancy (y) vs Deaths due to HIV/AIDS (x)")
The scatterplot below, “Life Expectancy (y) vs Years of Schooling (x)”, shows a slightly positive trend on life expectancy versus years of schooling. As years of schooling increased, people’s educated level increased and they could have better life quality so that their life expectancy will finally raise.
ggplot(life, aes(x=Schooling, y=Life.expectancy)) +
geom_point() +
geom_smooth() +
ylab("Life Expectancy (Age)") +
xlab("Years of Schooling")+
ggtitle("Life Expectancy (y) vs Years of Schooling (x)")
Earlier, a correlation was performed to determine whether a relationship existed between ‘life expectancy’ and the numerical variables. Here, the factor variable ‘Status’ will be explored to determine whether the Status of a country results in differing life expectancies. The factor variables that were removed earlier were ‘Country’ and ‘Status’. The ‘Status’ was that of countries that were ‘Developed’ and ‘Developing’. To determine whether a country’s life expectancy was different for ‘Developed’ versus ‘Developing’ countries, a t-test was performed. It was found that the average life expectancy in the developed countries was 78.7 years of age and that in the developing countries was 67.7 years. The t-test results found the mean life expectancy to be significantly different, with a p-value of less than 2e-16.
#summary(life)
life_developed=na.omit(subset(life, Status=='Developed')) #subset of countries that are Developed
life_developing=na.omit(subset(life, Status=='Developing')) #subset of countries that are Developing
#Life expectancy of Developed/Developing are different as p<<0.05
mean_life_developed = mean(life_developed$Life.expectancy)
mean_life_developing = mean(life_developing$Life.expectancy)
t_life_developed = t.test(x=life_developed$Life.expectancy, conf.level=0.95 )
t_life_developing= t.test(x=life_developing$Life.expectancy, conf.level=0.95 )
t_life_developed$conf.int
t_life_developing$conf.int
mean_life_developed
mean_life_developing
ttest_status = t.test(life_developed$Life.expectancy, life_developing$Life.expectancy)
To visually discern the difference in life expectancies, histograms (one showing actual counts and the other showing proportions) were generated. As can be seen, the life expectancy of those in developed countries are all grouped toward the higher (right-hand) side of the histogram, while those in developing countries are, on average, much lower.
#make overlaying histograms of the two subgroups aes(x = rank, y = gpa, fill = admit)
# First distribution
hist(life_developing$Life.expectancy, col=rgb(1,0,0,0.5), xlab="Life Expectancy",
ylab="Count", main="Life Expectancy of Developed vs Developing Countries-Actual Count" )
# Second with add=T to plot on top
hist(life_developed$Life.expectancy, col=rgb(0,0,1,0.5), add=T)
# Add legend
legend("topright", legend=c("Developing","Developed"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)), pt.cex=2, pch=15 )
#Histogram Plot of Developed vs Developing using relative frequency
#make overlaying histograms of the two subgroups aes(x = rank, y = gpa, fill = admit)
# First distribution
hist(life_developing$Life.expectancy, col=rgb(1,0,0,0.5), xlab="Life Expectancy",
ylab="Proportion", main="Life Expectancy of Developed vs Developing Countries-Relative Frequencies" , freq=F,
ylim = c(0,0.15))
# Second with add=T to plot on top
hist(life_developed$Life.expectancy, col=rgb(0,0,1,0.5), add=T, freq=F)
# Add legend
legend("topright", legend=c("Developing","Developed"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)), pt.cex=2, pch=15 )
The first histogram shows the actual count of how many countries falling at each bin of life expectancy years. As can be shown, there is a far greater amount of developing countries than developed. The second histogram shows the proportion of each group (developed vs. developing) falling at each life expectancy bin. Both graphs indicate that the developed countries have a higher life expectancy than the developing. This further illustrates the difference in average life expectancy between developing countries and developed.
#t-test of developed vs developing of variables found to have significantly high correlations with Life Expectancy
ttest_income = t.test(life_developed$Income.composition.of.resources, life_developing$Income.composition.of.resources)
ttest_school = t.test(life_developed$Schooling, life_developing$Schooling)
ttest_BMI = t.test(life_developed$BMI, life_developing$BMI)
ttest_AIDS = t.test(life_developed$HIV.AIDS, life_developing$HIV.AIDS)
ttest_GDP = t.test(life_developed$GDP, life_developing$GDP)
ttest_Alcohol = t.test(life_developed$Alcohol, life_developing$Alcohol)
ttest_expend = t.test(life_developed$percentage.expenditure, life_developing$percentage.expenditure)
ttest_thin19 = t.test(life_developed$thinness..1.19.years, life_developing$thinness..1.19.years)
ttest_thin5 = t.test(life_developed$thinness.5.9.years, life_developing$thinness.5.9.years)
ttest_polio = t.test(life_developed$Polio, life_developing$Polio)
ttest_dip = t.test(life_developed$Diphtheria, life_developing$Diphtheria)
ttest_infantdeath = t.test(life_developed$infant.deaths, life_developing$infant.deaths)
ttest_hep = t.test(life_developed$Hepatitis.B, life_developing$Hepatitis.B)
ttest_measles = t.test(life_developed$Measles, life_developing$Measles)
ttest_under5 = t.test(life_developed$under.five.deaths, life_developing$under.five.deaths)
ttest_totalespend = t.test(life_developed$Total.expenditure, life_developing$Total.expenditure)
ttest_income
ttest_school
ttest_BMI
ttest_AIDS
ttest_GDP
ttest_Alcohol
ttest_expend
ttest_thin19
ttest_thin5
ttest_polio
ttest_dip
ttest_infantdeath
ttest_hep
ttest_measles
ttest_under5
ttest_totalespend
For each variable that was found to be significantly correlated (p-value less than 0.05) to the life expectancy, the difference in means was tested between the developed and developing countries. As stated previously, all numerical values except for population size was found to be significantly correlated to life expectancy. Similarly, the means of each of these variables was found to be significantly different, as well. The results of the t-tests can be found in the table below. The variable ‘population’ was not tested, as data was collected for the same years for both the developing and developed countries. Thus, the average year of data collection would be the same and no insight would be gained from this.
#Correlation of variables table
ttest_table = matrix(c("0.836", "0.596", "<2e-16", "15.6", "11.5", "<2e-16", "52.3", "35.7", "<2e-16", "0.10", "2.31", "<2e-16", "18977", "3259", "<2e-16", "10.44", "3.52", "<2e-16", "2657", "362", "<2e-16", "1.44", "5.44", "<2e-16", "1.46", "5.50", "<2e-16", "94.5", "81.7", "<2e-16", "94.6", "82.4", "<2e-16", "0.872", "38.002", "<2e-16", "87.9", "77.7", "8e-12", "475", "2525", "1e-10", "1.09", "51.64", "<2e-16", "7.02", "5.77", "2e-11"), ncol=3, byrow=TRUE)
colnames(ttest_table)=c("Mean Value Developed", "Mean Value Developing", "P-value")
rownames(ttest_table)=c("Income level, from 0 to 1", "Years of Schooling", "BMI", "AIDS, per 1000 people", "Gross Domestic Product per Capita", "Alcohol in liters/capita", "% Government Expenditure", "Thinness 10-19 yo, % prevalence", "Thinness 5-9 yo, % prevalence", "Polio, % immunization coverage or 1 yo", "Diphtheria, % immunization coverage or 1 yo", "Infant Deaths, per 1000 people", "Hepatitis B, % immunization coverage or 1 yo", "Measles, per 1000 people", "Under Age 5 Deaths, per 1000 people", "Total Government Expenditure")
library(knitr)
kable(ttest_table, caption="T-test Between Developed and Developing Countries for Variables Highly Correlated to Life Expectancy")
| Mean Value Developed | Mean Value Developing | P-value | |
|---|---|---|---|
| Income level, from 0 to 1 | 0.836 | 0.596 | <2e-16 |
| Years of Schooling | 15.6 | 11.5 | <2e-16 |
| BMI | 52.3 | 35.7 | <2e-16 |
| AIDS, per 1000 people | 0.10 | 2.31 | <2e-16 |
| Gross Domestic Product per Capita | 18977 | 3259 | <2e-16 |
| Alcohol in liters/capita | 10.44 | 3.52 | <2e-16 |
| % Government Expenditure | 2657 | 362 | <2e-16 |
| Thinness 10-19 yo, % prevalence | 1.44 | 5.44 | <2e-16 |
| Thinness 5-9 yo, % prevalence | 1.46 | 5.50 | <2e-16 |
| Polio, % immunization coverage or 1 yo | 94.5 | 81.7 | <2e-16 |
| Diphtheria, % immunization coverage or 1 yo | 94.6 | 82.4 | <2e-16 |
| Infant Deaths, per 1000 people | 0.872 | 38.002 | <2e-16 |
| Hepatitis B, % immunization coverage or 1 yo | 87.9 | 77.7 | 8e-12 |
| Measles, per 1000 people | 475 | 2525 | 1e-10 |
| Under Age 5 Deaths, per 1000 people | 1.09 | 51.64 | <2e-16 |
| Total Government Expenditure | 7.02 | 5.77 | 2e-11 |
There are many takeaways from table above. First, are the factors that attribute to the health of the individuals in the developed versus developing countries. The immunization levels of the developed countries is higher, by over 10%, than the developing countries for each of the major diseases listed; Polio, Diphtheria, and Hepatitis B. For diseases in which there is no vaccine, the incidence rates per 1000 people is much higher, as can be seen for Measles and AIDS. Additionally, the prevalence of thinness in the developing countries is much higher, with the average BMI being lower. These factors likely attribute to the higher death rates for children and infants depicted above. In the developed countries, the level of education is over 4 years longer than the developed, which may lead to the higher level of income seen in the developed countries. It can also be seen that the government expenditure, whether you are referring to total expenditure or percentage, is much higher in the developed countries. This could lead one to believe that life expectancy could potentially be extended with higher levels of government aid.
ggplot(life, aes(x=Year, y=Life.expectancy, col=Status))+
geom_point() +
geom_smooth() +
facet_wrap(~Status)+
ggtitle("Life Expectancy (y) vs Year (x) in Developed and Developing Nations")
This above graph shows the average life expectancy with respective of developed and developing countries over the years. We can clearly see the average life expectancy for both developed and developing countries increases from 2000 to 2015, but developed countries have more life expectancy than developing countries.
However, there is an unusual data point shown on the developing countries with life expectancy less than 40 years in 2010. From the data, we know that it is country Haiti, and the earthquake which happened in 2010 caused Haiti country’s life expectancy drop rapidly.
Earlier in this paper, an example statistic was given to illustrate the disparity in life expectancy by location. For the purpose of visualization, we constructed a world map to display the distribution of life expectancies in different countries. Our world map graphic helps visualize the difference in life expectancy between developed and developing countries, as well as how life expectancy changes in countries over time. To create the world map, we used ggplot2’s map_data function. The purpose of map_data is to transform data from the maps package into a data frame so that country data may be plotted. For our world map, we used map_data with the map ‘world’ to store world data in a data frame which we called world_data.
Our preprocessing for the world map consisted of adding levels and changing the names for different countries in our dataset to match what was present in ggplot2’s map_data. In order to plot data on a world map, you must have columns for latitude and longitude, which effectively serve as the axes for plotting. Our dataset did not contain latitude-longitude data, but world_data does. Thus, we used the match function to merge our data set and world_data together to create a data frame suitable for plotting. However, the match function effectively performs an inner join, and because we are performing a merge on the country column, the names of the countries must match in order to preserve data during the merge. However, the naming convention used in world_data differed from our dataset. For example, in our dataset the United States is titled “United States of America,” but in world_data it is titled “USA.” If we perform a merge, all the data for United States of America would be lost because there is no such country in world_data. However, by adding levels to our dataset to change the country names to match what is present in world_data we will end up with a data frame containing all countries, their data, and their respective latitude-longitude coordinates for plotting.
data <- read.csv("lifeexpectancydata.csv")
data_world_map <- data.frame(data)
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'USA')
data_world_map$Country[data_world_map$Country == 'United States of America'] <- 'USA'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'UK')
data_world_map$Country[data_world_map$Country == 'United Kingdom of Great Britain and Northern Ireland'] <- 'UK'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Russia')
data_world_map$Country[data_world_map$Country == 'Russian Federation'] <- 'Russia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Bolivia')
data_world_map$Country[data_world_map$Country == 'Bolivia (Plurinational State of)'] <- 'Bolivia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Brunei')
data_world_map$Country[data_world_map$Country == 'Brunei Darussalam'] <- 'Brunei'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Czech Republic')
data_world_map$Country[data_world_map$Country == 'Czechia'] <- 'Czech Republic'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'North Korea')
data_world_map$Country[data_world_map$Country == 'Democratic People\'s Republic of Korea'] <- 'North Korea'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Iran')
data_world_map$Country[data_world_map$Country == 'Iran (Islamic Republic of)'] <- 'Iran'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Laos')
data_world_map$Country[data_world_map$Country == 'Lao People\'s Democratic Republic'] <- 'Laos'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Micronesia')
data_world_map$Country[data_world_map$Country == 'Micronesia (Federated States of)'] <- 'Micronesia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'South Korea')
data_world_map$Country[data_world_map$Country == 'Republic of Korea'] <- 'South Korea'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Moldova')
data_world_map$Country[data_world_map$Country == 'Republic of Moldova'] <- 'Moldova'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Saint Vincent')
data_world_map$Country[data_world_map$Country == 'Saint Vincent and the Grenadines'] <- 'Saint Vincent'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Syria')
data_world_map$Country[data_world_map$Country == 'Syrian Arab Republic'] <- 'Syria'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Macedonia')
data_world_map$Country[data_world_map$Country == 'The former Yugoslav republic of Macedonia'] <- 'Macedonia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Tanzania')
data_world_map$Country[data_world_map$Country == 'United Republic of Tanzania'] <- 'Tanzania'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Venezuela')
data_world_map$Country[data_world_map$Country == 'Venezuela (Bolivarian Republic of)'] <- 'Venezuela'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Vietnam')
data_world_map$Country[data_world_map$Country == 'Viet Nam'] <- 'Vietnam'
two_thousand <- subset(x=data_world_map, data_world_map$Year==2000)
Once levels were added to account for eighteen different country name mismatches, the world map was constructed using ggplot2. It is important to note that to display the life expectancy to visualize the difference between developed and developing countries, the mean life expectancy across all years for each country (2000-2015) was computed and displayed. By using the mean value, we can get an accurate general sense of how life expectancy differs by location. For the graphic that displays trends in life expectancy changes across the years, every five years between 2000 and 2015 was graphed. The reasoning for printing only four graphs was because printing out sixteen different graphs is useless and arguably harder to visualize differences between the years due to such small year-to-year changes in life expectancy. The result of this world map visualization is that the gap in average life expectancy between developed and developing countries is about ten to fifteen years, which is astounding. This data reassured us that the status of a country has a profound correlation with its life expectancy. However, the status of a country carries several other factors with it (i.e. GDP, population and disease levels) which are likely the more measurable contributors to the value of life expectancy.
### Combining and plotting LE data from the year 2000
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
world_data <- map_data('world')
combined_2000 <- world_data[two_thousand$Country %in% two_thousand$Country, ]
combined_2000$value <- two_thousand$Life.expectancy[match(combined_2000$region, two_thousand$Country)]
countries <- unique(combined_2000$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2000$value <- ifelse(combined_2000$Country %in% cdf$label1[i], (two_thousand$Life.expectancy), combined_2000$value)
}
ggplot(data=combined_2000, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2000)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
### Combining and plotting LE data from the year 2005
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_five <- subset(x=data_world_map, data_world_map$Year==2005)
combined_2005 <- world_data[two_thousand_five$Country %in% two_thousand_five$Country, ]
combined_2005$value <- two_thousand_five$Life.expectancy[match(combined_2005$region, two_thousand_five$Country)]
countries <- unique(combined_2005$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2005$value <- ifelse(combined_2005$Country %in% cdf$label1[i], (two_thousand_five$Life.expectancy), combined_2005$value)
}
ggplot(data=combined_2005, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2005)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
### Combining and plotting LE data from the year 2010
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_ten <- subset(x=data_world_map, data_world_map$Year==2010)
combined_2010 <- world_data[two_thousand_ten$Country %in% two_thousand_ten$Country, ]
combined_2010$value <- two_thousand_ten$Life.expectancy[match(combined_2010$region, two_thousand_ten$Country)]
countries <- unique(combined_2010$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2010$value <- ifelse(combined_2010$Country %in% cdf$label1[i], (two_thousand_ten$Life.expectancy), combined_2010$value)
}
ggplot(data=combined_2010, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2010)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
### Combining and plotting LE data from the year 2015
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_fifteen <- subset(x=data_world_map, data_world_map$Year==2015)
combined_2015 <- world_data[two_thousand_fifteen$Country %in% two_thousand_fifteen$Country, ]
combined_2015$value <- two_thousand_fifteen$Life.expectancy[match(combined_2015$region, two_thousand_fifteen$Country)]
countries <- unique(combined_2015$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2015$value <- ifelse(combined_2015$Country %in% cdf$label1[i], (two_thousand_fifteen$Life.expectancy), combined_2015$value)
}
ggplot(data=combined_2015, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2015)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
### Subsetting and Pre-processing for Mean Life Expectancy
loadPkg("dplyr")
data_le_means <- data.frame(data)
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'USA')
data_le_means$Country[data_le_means$Country == 'United States of America'] <- 'USA'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'UK')
data_le_means$Country[data_le_means$Country == 'United Kingdom of Great Britain and Northern Ireland'] <- 'UK'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Russia')
data_le_means$Country[data_le_means$Country == 'Russian Federation'] <- 'Russia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Bolivia')
data_le_means$Country[data_le_means$Country == 'Bolivia (Plurinational State of)'] <- 'Bolivia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Brunei')
data_le_means$Country[data_le_means$Country == 'Brunei Darussalam'] <- 'Brunei'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Czech Republic')
data_le_means$Country[data_le_means$Country == 'Czechia'] <- 'Czech Republic'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'North Korea')
data_le_means$Country[data_le_means$Country == 'Democratic People\'s Republic of Korea'] <- 'North Korea'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Iran')
data_le_means$Country[data_le_means$Country == 'Iran (Islamic Republic of)'] <- 'Iran'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Laos')
data_le_means$Country[data_le_means$Country == 'Lao People\'s Democratic Republic'] <- 'Laos'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Micronesia')
data_le_means$Country[data_le_means$Country == 'Micronesia (Federated States of)'] <- 'Micronesia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'South Korea')
data_le_means$Country[data_le_means$Country == 'Republic of Korea'] <- 'South Korea'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Moldova')
data_le_means$Country[data_le_means$Country == 'Republic of Moldova'] <- 'Moldova'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Saint Vincent')
data_le_means$Country[data_le_means$Country == 'Saint Vincent and the Grenadines'] <- 'Saint Vincent'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Syria')
data_le_means$Country[data_le_means$Country == 'Syrian Arab Republic'] <- 'Syria'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Macedonia')
data_le_means$Country[data_le_means$Country == 'The former Yugoslav republic of Macedonia'] <- 'Macedonia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Tanzania')
data_le_means$Country[data_le_means$Country == 'United Republic of Tanzania'] <- 'Tanzania'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Venezuela')
data_le_means$Country[data_le_means$Country == 'Venezuela (Bolivarian Republic of)'] <- 'Venezuela'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Vietnam')
data_le_means$Country[data_le_means$Country == 'Viet Nam'] <- 'Vietnam'
le_na = subset(data_le_means, is.na(Life.expectancy))
le_clean = anti_join(data_le_means, le_na)
country_and_status <- le_clean %>% group_by(Country, Status) %>% summarise(mean_le = mean(Life.expectancy))
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
world_data <- map_data('world')
combined_means <- world_data[country_and_status$Country %in% country_and_status$Country, ]
combined_means$value <- country_and_status$mean_le[match(combined_means$region, country_and_status$Country)]
countries <- unique(combined_means$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_means$value <- ifelse(combined_means$Country %in% cdf$label1[i], (country_and_status$mean_le), combined_means$value)
}
ggplot(data=combined_means, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Mean Life Expectancy Across the World (2000-2015)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
In summary, life expectancy was found to be significantly correlated with all numeric variables except population, and each of these variables differed in their average values between the developed and developing countries. This indicates that life expectancy is related to the status of a country, that being developed or developing. The developing countries were found to have an average life expectancy of 67.7 years and the developed countries to have a life expectancy of 78.7 years. Though all numeric variables except population size were found to be significantly correlated with life expectancy, income level, years of schooling, BMI, and prevalence of AIDS were those variables with the highest correlation coefficient, with Pearson’s r values of 0.73, 0.75, 0.57, and -0.56, respectively. Thus, indicating that the higher the income level, level of education, and BMI of an individual, the longer the life expectancy is. Additionally, this indicates that the lower the prevalence of AIDS, the longer the life expectancy, which one would expect. Thus, the answer to our SMART question is that the health factors that influence life expectancy across the world include adult mortality rate, infant deaths, alcohol, percentage expenditure, hepatitis B cases, measles cases, BMI, under-five years old deaths, polio cases, total expenditure, diphtheria cases, HIV/AIDS cases, GDP, thinness in 1-19 year-olds and 5-9 year-olds, income and schooling. Throughout our exploratory data analysis we found that our smart question stayed consistent with each graph we analyzed and test we conducted. We found that health factors definitely affect life expectancy and were able to identify certain factors that correlated the most and would be most reliable to use when creating models. For future analysis and for building models it will be important to clean the data thoroughly and identify all instances of incorrect values.
K. Rajarshi. Life Expectancy (WHO). Kaggle, 2018. https://www.kaggle.com/kumarajarshi/life-expectancy-who
Global Health Observatory (GHO) data: Life Expectancy. WHO. https://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends_text/en/